Library Imports
from pyspark.sql import SparkSession
from pyspark.sql import types as T
from pyspark.sql import functions as F
from datetime import datetime
from decimal import Decimal
Template
spark = (
    SparkSession.builder
    .master("local")
    .appName("Section 2.6 - Filtering Data")
    .config("spark.some.config.option", "some-value")
    .getOrCreate()
)
sc = spark.sparkContext
import os
data_path = "/data/pets.csv"
base_path = os.path.dirname(os.getcwd())
path = base_path + data_path
pets = spark.read.csv(path, header=True)
pets.toPandas()
| id | breed_id | nickname | birthday | age | color | |
|---|---|---|---|---|---|---|
| 0 | 1 | 1 | King | 2014-11-22 12:30:31 | 5 | brown | 
| 1 | 2 | 3 | Argus | 2016-11-22 10:05:10 | 10 | None | 
| 2 | 3 | 1 | Chewie | 2016-11-22 10:05:10 | 15 | None | 
Filtering Data
Again another commonly used function in data analysis, filtering out unwanted rows.
Option 1 - where()
(
    pets
    .where(F.col('breed_id') == 1)
    .filter(F.col('color') == 'brown')
    .toPandas()
)
| id | breed_id | nickname | birthday | age | color | |
|---|---|---|---|---|---|---|
| 0 | 1 | 1 | King | 2014-11-22 12:30:31 | 5 | brown | 
What Happened?
Similar to the functions we have seen so far, there are multiple functioned that get alias to different names that perform the same transformation. IMO I perfor where as it's a bit more intuitive and closer to the sql syntax.
Note:
Notice how we don't have to wrap 1 or brown in a F.lit() function as these conditions are columnary expressions.
We will look into how to perform more complex conditions in 2.1.7 that contain more than 1 condition.
Option 2 - isin()
(
    pets
    .where(F.col('nickname').isin('King', 'Argus'))
    .toPandas()
)
| id | breed_id | nickname | birthday | age | color | |
|---|---|---|---|---|---|---|
| 0 | 1 | 1 | King | 2014-11-22 12:30:31 | 5 | brown | 
| 1 | 2 | 3 | Argus | 2016-11-22 10:05:10 | 10 | None | 
What Happened?
If you want to know if a column can be of many values then you can use the isin() function. This function takes in both a list of values of comma seperated values. This is again very similar to sql syntax.
Summary
- We learnt of two filter functions in Spark where()andisin().
- Using isinyou can see if a column can contain multiple values.
- These functions are named similarly to a sqllanguage.